{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "978146e2",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/openvino.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f717d3d4-942b-4d86-9435-fc44b3ac6d39",
   "metadata": {},
   "source": [
    "# OpenVINO LLMs\n",
    "\n",
    "[OpenVINO™](https://github.com/openvinotoolkit/openvino) is an open-source toolkit for optimizing and deploying AI inference. OpenVINO™ Runtime can enable running the same model optimized across various hardware [devices](https://github.com/openvinotoolkit/openvino?tab=readme-ov-file#supported-hardware-matrix). Accelerate your deep learning performance across use cases like: language + LLMs, computer vision, automatic speech recognition, and more.\n",
    "\n",
    "OpenVINO models can be run locally through `OpenVINOLLM` entitiy wrapped by LlamaIndex :"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90cf0f2e-8d8d-4e42-81bf-866c759221e1",
   "metadata": {},
   "source": [
    "In the below line, we install the packages necessary for this demo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f413f179",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-llms-openvino transformers huggingface_hub"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dac8f9f-7136-43f7-9e9f-de679e74d66e",
   "metadata": {},
   "source": [
    "Now that we're set up, let's play around:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2c577674",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86028752",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0465029c-fe69-454a-9561-55f7a382b2e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llms.openvino import OpenVINOLLM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49122583",
   "metadata": {},
   "outputs": [],
   "source": [
    "def messages_to_prompt(messages):\n",
    "    prompt = \"\"\n",
    "    for message in messages:\n",
    "        if message.role == \"system\":\n",
    "            prompt += f\"<|system|>\\n{message.content}</s>\\n\"\n",
    "        elif message.role == \"user\":\n",
    "            prompt += f\"<|user|>\\n{message.content}</s>\\n\"\n",
    "        elif message.role == \"assistant\":\n",
    "            prompt += f\"<|assistant|>\\n{message.content}</s>\\n\"\n",
    "\n",
    "    # ensure we start with a system prompt, insert blank if needed\n",
    "    if not prompt.startswith(\"<|system|>\\n\"):\n",
    "        prompt = \"<|system|>\\n</s>\\n\" + prompt\n",
    "\n",
    "    # add final assistant prompt\n",
    "    prompt = prompt + \"<|assistant|>\\n\"\n",
    "\n",
    "    return prompt\n",
    "\n",
    "\n",
    "def completion_to_prompt(completion):\n",
    "    return f\"<|system|>\\n</s>\\n<|user|>\\n{completion}</s>\\n<|assistant|>\\n\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3e21cef-b3c3-4ddd-a70c-728de440648e",
   "metadata": {},
   "source": [
    "### Model Loading\n",
    "\n",
    "Models can be loaded by specifying the model parameters using the `OpenVINOLLM` method.\n",
    "\n",
    "If you have an Intel GPU, you can specify `device_map=\"gpu\"` to run inference on it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a27feba3-d027-4d10-b1af-1e130e764a67",
   "metadata": {},
   "outputs": [],
   "source": [
    "ov_config = {\n",
    "    \"PERFORMANCE_HINT\": \"LATENCY\",\n",
    "    \"NUM_STREAMS\": \"1\",\n",
    "    \"CACHE_DIR\": \"\",\n",
    "}\n",
    "\n",
    "ov_llm = OpenVINOLLM(\n",
    "    model_id_or_path=\"HuggingFaceH4/zephyr-7b-beta\",\n",
    "    context_window=3900,\n",
    "    max_new_tokens=256,\n",
    "    model_kwargs={\"ov_config\": ov_config},\n",
    "    generate_kwargs={\"temperature\": 0.7, \"top_k\": 50, \"top_p\": 0.95},\n",
    "    messages_to_prompt=messages_to_prompt,\n",
    "    completion_to_prompt=completion_to_prompt,\n",
    "    device_map=\"cpu\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e25c7162",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = ov_llm.complete(\"What is the meaning of life?\")\n",
    "print(str(response))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "072dd59e-e3e7-41b9-b6fb-07bb41a82d2c",
   "metadata": {},
   "source": [
    "### Inference with local OpenVINO model\n",
    "\n",
    "It is possible to [export your model](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#export) to the OpenVINO IR format with the CLI, and load the model from local folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41dfc1a1-0aea-4136-a194-0428b89dc3cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e7683ab-66ae-4fbc-af20-6e3ec524d28a",
   "metadata": {},
   "source": [
    "It is recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint using `--weight-format`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92d69e87",
   "metadata": {},
   "outputs": [],
   "source": [
    "!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9e96c9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6982d98",
   "metadata": {},
   "outputs": [],
   "source": [
    "ov_llm = OpenVINOLLM(\n",
    "    model_id_or_path=\"ov_model_dir\",\n",
    "    context_window=3900,\n",
    "    max_new_tokens=256,\n",
    "    model_kwargs={\"ov_config\": ov_config},\n",
    "    generate_kwargs={\"temperature\": 0.7, \"top_k\": 50, \"top_p\": 0.95},\n",
    "    messages_to_prompt=messages_to_prompt,\n",
    "    completion_to_prompt=completion_to_prompt,\n",
    "    device_map=\"gpu\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e26a478-2974-4ced-89e8-c13a64a409b2",
   "metadata": {},
   "source": [
    "You can get additional inference speed improvement with Dynamic Quantization of activations and KV-cache quantization. These options can be enabled with `ov_config` as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01c89828-a94b-4242-baf0-204eb0f1c87a",
   "metadata": {},
   "outputs": [],
   "source": [
    "ov_config = {\n",
    "    \"KV_CACHE_PRECISION\": \"u8\",\n",
    "    \"DYNAMIC_QUANTIZATION_GROUP_SIZE\": \"32\",\n",
    "    \"PERFORMANCE_HINT\": \"LATENCY\",\n",
    "    \"NUM_STREAMS\": \"1\",\n",
    "    \"CACHE_DIR\": \"\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dda1be10",
   "metadata": {},
   "source": [
    "### Streaming\n",
    "\n",
    "Using `stream_complete` endpoint "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12e0f3c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = ov_llm.stream_complete(\"Who is Paul Graham?\")\n",
    "for r in response:\n",
    "    print(r.delta, end=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c87c383",
   "metadata": {},
   "source": [
    "Using `stream_chat` endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2db801a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.llms import ChatMessage\n",
    "\n",
    "messages = [\n",
    "    ChatMessage(\n",
    "        role=\"system\", content=\"You are a pirate with a colorful personality\"\n",
    "    ),\n",
    "    ChatMessage(role=\"user\", content=\"What is your name\"),\n",
    "]\n",
    "resp = ov_llm.stream_chat(messages)\n",
    "\n",
    "for r in resp:\n",
    "    print(r.delta, end=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fa723d6-4308-4d94-9609-8c51ce8184c3",
   "metadata": {},
   "source": [
    "For more information refer to:\n",
    "\n",
    "* [OpenVINO LLM guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html).\n",
    "\n",
    "* [OpenVINO Documentation](https://docs.openvino.ai/2024/home.html).\n",
    "\n",
    "* [OpenVINO Get Started Guide](https://www.intel.com/content/www/us/en/content-details/819067/openvino-get-started-guide.html).\n",
    "\n",
    "* [RAG example with LlamaIndex](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-rag-llamaindex)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
